_IMDb (an acronym for Internet Movie Database)¹ is an online database of information related to films, television programs, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews. An additional fan feature, message boards, was abandoned in February 2017. Originally a fan-operated website, the database is now owned and operated by IMDb.com, Inc., a subsidiary of Amazon._
_As of December 2020, IMDb has approximately 7.5 million titles (including episodes) and 10.4 million personalities in its database,² as well as 83 million registered users._
IMDb began as a movie database on the Usenet group "rec.arts.movies" in 1990 and moved to the web in 1993. source
We intended to use a dataset consisting of IMDb information to create a machine learning model which takes a movie’s features as input and predicts which potentially ratings. This tool would be useful for measuring the popularity of new films based on their features. Based on this aim, we determined our guiding questions and these are given below:
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')
# Import the necessaries libraries
#data wrangling
import pandas as pd
import numpy as np
import datetime as dt
import glob
import random
#visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
import plotly.offline as pyo
import plotly.graph_objs as go
#visualazation libraries
import plotly.express as px
import plotly.offline as pyo
import plotly.graph_objs as go
# Set notebook mode to work in offline
pyo.init_notebook_mode()
#ignore warnings
import warnings
warnings.filterwarnings("ignore")
IMDb Datasets
Subsets of IMDb data are available for access to customers for personal and non-commercial use. You can hold local copies of this data, and it is subject to our terms and conditions. Please refer to the Non-Commercial Licensing and copyright/license and verify compliance.
Data Location
The dataset files can be accessed and downloaded from https://datasets.imdbws.com/. The data is refreshed daily.
IMDb Dataset Details
Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set. The first line in each file contains headers that describe what is in each column. A ‘\N’ is used to denote that a particular field is missing or null for that title/name. The available datasets are as follows:
We converted all files that we will use for our analysis to parquet format. For this process, we chose to thread the pool mapping process for time-saving.
def convert_csv_to_parquet( src ):
df = pd.read_csv(src,sep="\t",low_memory=False, na_values=["\\N","nan"])
df.to_parquet(src.split(".tsv.gz")[0]+".parquet", compression='brotli')
%%time
import multiprocessing
from multiprocessing.pool import ThreadPool
import glob
files = glob.glob('*.tsv.gz')
pool = ThreadPool(processes=multiprocessing.cpu_count())
pool.map(convert_csv_to_parquet, files)
df_akas = pd.read_parquet("title.akas.parquet")
df_basics = pd.read_parquet("title.basics.parquet")
df_ratings = pd.read_parquet("title.ratings.parquet")
df_principals = pd.read_parquet("title.principals.parquet")
df_crew = pd.read_parquet("title.crew.parquet")
print("Akas Table:")
display(df_akas.sample())
print("Basics Table:")
display(df_basics.sample())
print("Ratings Table:")
display(df_ratings.sample())
print("Principals Table:")
display(df_principals.sample())
print("Crew Table:")
display(df_crew.sample())
Our cleaning and wrangling activities included:
We changed the 'titleId' columns to 'tconst' to get the same key column names.
Merging:
We merged dk_akas, df_basics and df_ratings respectively with using 'tconst' and we used outer join to not lose any data. We eliminated df_crew and df_principlas because they are not needed for our analysis.
Limiting:
We want to predict the ratings of movies, so we only chose movies. Additionally, because there is limited information about movies that filmed before 1950, we preferred to select after movies that are filmed after this date.
df_akas.rename(columns = {'titleId':'tconst'},inplace = True)
# merging
df = pd.merge(df_akas,df_basics,on='tconst',how='outer')
df = pd.merge(df,df_ratings,on='tconst',how='outer')
df.head()
df.info()
df.describe(include=[np.number])
df.describe(include=[object])
df.titleType.value_counts()
# limiting
df_movie = df.loc[df.titleType == 'movie']
df_movie = df_movie.loc[df.startYear > 1950]
df_movie = df_movie[['tconst','title', 'region', 'language', 'types','primaryTitle', 'originalTitle', 'startYear', 'runtimeMinutes', 'genres', 'averageRating', 'numVotes']]
df_movie.sample(3)
df_movie.info()
Because we have lots of countries, it will be hard to convert all of them as dummies so we decided to use coordinations of them instead of converting countries as dummies. So, we added latitude and longitude informations from an open use csv file.
df_loc= pd.read_csv("https://raw.githubusercontent.com/cristiroma/countries/master/data/csv/countries.csv",header=None)
df_loc.rename(columns = {0:'countries', 2:'region',4:'Latitude',5:'Longitude'},inplace = True)
df_loc.drop([1,3,6],axis=1,inplace=True)
df_loc.to_parquet(".parquet", compression='brotli')
df_movie = pd.merge(df_movie,df_loc,on='region',how='left')
Our target value is 'averageRating' so we should clean all missing values in this column and the others that we use as predictory features in our analysis. So when we examine each missing value:
#1- find&remove duplications
sum(df_movie.duplicated(subset=['title','region','language','types','primaryTitle','originalTitle','startYear','runtimeMinutes','genres','averageRating','numVotes','countries']))
df_movie = df_movie.drop_duplicates(subset=['title','region','language','types','primaryTitle','originalTitle','startYear','runtimeMinutes','genres','averageRating','numVotes','countries'])
df_movie.loc[(df_movie['originalTitle'] == 'The Shawshank Redemption') ]
df_movie.loc[(df_movie['originalTitle'] == 'The Shawshank Redemption') & (df_movie['types'] == 'working')]
df_movie = df_movie.loc[df_movie.types == 'working']
#checking missing values
df_movie.isnull().sum()
#drop missing values
df_movie = df_movie.dropna(subset=['averageRating','Latitude','genres'])
#Distribution of Runtimes of Movies
df_movie.runtimeMinutes = df_movie.runtimeMinutes.astype(float)
ax = sns.boxplot(x=df_movie.runtimeMinutes)
df_movie = df_movie.loc[df_movie.runtimeMinutes < 40000]
ax = sns.boxplot(x=df_movie.runtimeMinutes)
#drop unwanted columns
df_movie = df_movie.drop(['title','region','language','types'],axis=1)
df_movie.isnull().sum()
Because we have lots of genres, we wanted to simplify them for our analysis. So, firstly we looked at the top 25 genres distribution, then we selected the common seven of them. We used 'Other' for the rest of these genres.
df_movie.genres.value_counts()[:25]
def genre_finder(x):
df_genres = set(x.split(','))
extract_words = genres_set.intersection(df_genres)
return ','.join(extract_words)
genres_set = {'Drama', 'Comedy', 'Documentary', 'Horror', 'Thriller','Action','Western'}
df_movie['new_genres'] = df_movie.genres.apply(genre_finder)
name = df_movie['new_genres']
df_movie['new_genres'] = [i.split(",")[0].strip() for i in name]
df_movie['new_genres'] = df_movie.new_genres.replace('','Other')
df_movie.new_genres.value_counts()
One of IMDB's greatest features is its exhaustive catalog of cast and crew information. How to include this information in a regression model proved to be one of the bigger challenges in this project. Our solution to capture this valuable information was to calculate a count of how many 'big name' actors star in a particular title. In order to complete this step, we wrote a web scraper to collect the IMDB identifiers 'nconst' from this list of the top 1000 actors. Of course this is a subjective evaluation of the best actors/actresses, but the list seems to reasonably cover the 'big name' actors of Hollywood from the past century of movies. The analogous procedure was also conducted for directors from this list of top directors.
df_actors = df_principals.loc[(df_principals.category == 'actor') | (df_principals.category == 'actress')]
df_actors = df_actors[['tconst', 'nconst']]
df_actors = pd.DataFrame(df_actors.groupby('tconst')['nconst'].apply(lambda x: ','.join(x)))
df_actors.sample()
main_df = pd.merge(df_actors,df_crew,on='tconst',how='outer')
main_df = main_df.rename(columns={'nconst':'actors'})
main_df = main_df.drop('writers',axis=1)
main_df.sample(3)
# # The following code block has been commented out because it ran as a separate script and
# #its result has been saved into a separate file. Code is included for posterity.
# from bs4 import BeautifulSoup
# import requests
# # Top actors list
# urls = [
# "https://www.imdb.com/list/ls058011111/?sort=list_order,asc&mode=detail&page=1",
# "https://www.imdb.com/list/ls058011111/?sort=list_order,asc&mode=detail&page=2",
# "https://www.imdb.com/list/ls058011111/?sort=list_order,asc&mode=detail&page=3",
# "https://www.imdb.com/list/ls058011111/?sort=list_order,asc&mode=detail&page=4",
# "https://www.imdb.com/list/ls058011111/?sort=list_order,asc&mode=detail&page=5",
# "https://www.imdb.com/list/ls058011111/?sort=list_order,asc&mode=detail&page=6",
# "https://www.imdb.com/list/ls058011111/?sort=list_order,asc&mode=detail&page=7",
# "https://www.imdb.com/list/ls058011111/?sort=list_order,asc&mode=detail&page=8",
# "https://www.imdb.com/list/ls058011111/?sort=list_order,asc&mode=detail&page=9",
# "https://www.imdb.com/list/ls058011111/?sort=list_order,asc&mode=detail&page=10"
# ]
# pages = [requests.get(url) for url in urls]
# soups = [BeautifulSoup(page.content, 'html.parser') for page in pages]
# actors = {}
# for soup in soups:
# for item in soup.find_all('h3', class_='lister-item-header'):
# actor = item.a.text.lstrip(' ').rstrip('\n')
# nmconst = item.a['href'].split('/name/')[1]
# actors[actor] = nmconst
# # Top directors list
# urls = ["https://www.imdb.com/list/ls066140407/?sort=list_order,asc&mode=detail&page=1",
# "https://www.imdb.com/list/ls066140407/?sort=list_order,asc&mode=detail&page=2",
# "https://www.imdb.com/list/ls066140407/?sort=list_order,asc&mode=detail&page=3",
# "https://www.imdb.com/list/ls066140407/?sort=list_order,asc&mode=detail&page=4",
# "https://www.imdb.com/list/ls066140407/?sort=list_order,asc&mode=detail&page=5",
# "https://www.imdb.com/list/ls066140407/?sort=list_order,asc&mode=detail&page=6",
# "https://www.imdb.com/list/ls066140407/?sort=list_order,asc&mode=detail&page=7",
# "https://www.imdb.com/list/ls066140407/?sort=list_order,asc&mode=detail&page=8",
# "https://www.imdb.com/list/ls066140407/?sort=list_order,asc&mode=detail&page=9",
# "https://www.imdb.com/list/ls066140407/?sort=list_order,asc&mode=detail&page=10",
# ]
# pages = [requests.get(url) for url in urls]
# soups = [BeautifulSoup(page.content, 'html.parser') for page in pages]
# directors = {}
# for soup in soups:
# for item in soup.find_all('h3', class_='lister-item-header'):
# director = item.a.text.lstrip(' ').rstrip('\n')
# nmconst = item.a['href'].split('/name/')[1]
# directors[director] = nmconst
# actors_df = pd.DataFrame.from_dict(actors, orient = 'index', columns = ['nmconst'])
# directors_df = pd.DataFrame.from_dict(directors, orient = 'index', columns = ['nmconst'])
# actors_df.head()
# actors_df.to_csv('top1000actors.csv')
# directors_df.to_csv('top1000directors.csv')
# def count_top_actors(actorlist):
# # Function that takes list of actor nconst's and returns a count of their occurence in top 1000 list
# actor_count = 0
# for actor in actorlist:
# #if actor in topactors['nconst']:
# if topactors['nconst'].str.contains(actor).any():
# actor_count += 1
# return actor_count
# def top_director(director):
# if topdirectors['nconst'].str.contains(director).any():
# return 1
# else:
# return 0
# def custom_map(data_split):
# data_split['topactors'] = data_split['actorlist'].map(lambda x : count_top_actors(x), na_action = 'ignore')
# data_split['topdirector'] = data_split['directors'].map(lambda x: top_director(x), na_action = 'ignore')
# return data_split
# cores = 14
# partitions = cores
# def parallelize(data, func):
# data_split = np.array_split(data, partitions)
# pool = Pool(cores)
# data = pd.concat(pool.map(func, data_split))
# pool.close()
# pool.join()
# return data
# main_df = pd.read_csv('titles and actors.csv')
# main_df.head()
# topactors = pd.read_csv('top1000actors.csv')
# topdirectors = pd.read_csv('top1000directors.csv')
# topactors.head()
# main_df.drop(columns = 'Unnamed: 0', inplace = True)
# main_df.head()
# main_df['actorlist'] = main_df['actors'].map(lambda x: x.split(','), na_action = 'ignore')
# main_df.head()
# import ipynb
# %%time
# actorcounts = parallelize(main_df, custom_map)
# actorcounts.head()
# plt.hist(actorcounts.topactors, bins =10)
# plt.show()
# actorcounts.to_csv('actorcounts.csv')
# Converting top 1000 actors and top 1000 directors as a parquet file
def convert_csv_to_parquet( src ):
df = pd.read_csv(src)
df.to_parquet(src.split(".csv")[0]+".parquet", compression='brotli')
%%time
import multiprocessing
from multiprocessing.pool import ThreadPool
import glob
files = glob.glob('*.csv')
pool = ThreadPool(processes=multiprocessing.cpu_count())
pool.map(convert_csv_to_parquet, files)
topactors = pd.read_parquet('top1000actors.parquet')
topdirectors = pd.read_parquet('top1000directors.parquet')
print("Top 1000 Actors:")
display(topactors.sample())
print("Top 1000 Directors:")
display(topdirectors.sample())
#actorcounts.to_csv('actorcounts.csv')
actorcounts.to_parquet('actorcounts.parquet')
actorcounts = pd.read_parquet('actorcounts.parquet')
display(actorcounts.sample(3))
df_movie = pd.merge(df_movie,actorcounts,on='tconst',how='left')
df_movie.drop(['Unnamed: 0','actors','directors','actorlist'],axis=1,inplace=True)
df_movie.sample()
df_movie = df_movie.fillna(0)
df_movie.isnull().sum()
df_movie.topactors.value_counts()
df_movie.topdirector.value_counts()
We have around 18,500 movies in a data frame, before moving the machine learning part, we wanted to understand the behaviour of each feature. So, firstly we looked at features relation among them and distributions of each of them. Our findings are visualized in the below graphs.
To understand each relation between target and predictor features, we categorized our target variable with divided 10 bins. When we looked at the average rating row, there is a clear relationship between the number of votes and average ratings. Therefore, we considered the number of votes as an important feature in our analysis.
nbins = 10
import pandas as pd
df_movie["RatingCat"] = pd.qcut(df_movie["averageRating"], q=nbins, labels=False)
# plot the pairwise scatterplot
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
print("Pairwise Scatterplots of Features")
sns.set(style="ticks")
sns.pairplot(df_movie, hue="RatingCat", palette="RdBu_r")
plt.show()
As we can expect, when the year passes, the number of movies increases. Just 2020 has a decrease, the possible reason for it can be pandemic. Average Ratings has a quite normal distribution, while runtimes look normal, but it still has some outliers. When we examine a number of votes, it has lots of values near zero, which may cause a problem in our analysis.
df_movie = df_movie.drop(['RatingCat'],axis=1)
import matplotlib
fig = plt.figure(constrained_layout=True, figsize=(20,10))
gs = matplotlib.gridspec.GridSpec(ncols=2, nrows=2, figure=fig)
df_sns = df_movie[['startYear', 'runtimeMinutes', 'averageRating', 'numVotes']]
print("Histograms")
for i, column in enumerate(df_sns.columns):
if df_sns[column].dtype.kind not in 'bifc': continue
ax = fig.add_subplot( gs[i//2,i%2])
sns.distplot(df_sns[column],ax=ax).set_title(column)
We already examined the start year of the movies above. When we looked at genre distribution, the majority of movies are drama, followed by comedy and action.
fig = plt.figure(constrained_layout=True, figsize=(20,20))
gs = matplotlib.gridspec.GridSpec(ncols=2, nrows=2, figure=fig)
df_sns = df_movie[['startYear', 'new_genres','topactors','topdirector']]
print("Frequencies")
for i, column in enumerate(df_sns.columns):
ax = fig.add_subplot( gs[i//2,i%2])
sns.countplot(df_sns[column],ax=ax).set_title(column)
ax.tick_params(labelrotation=90)
For a better understanding of the behaviours of the features, we thought that examing the top rated and bottom rated movies can be useful. For that reason, we limited our data whose number of votes is higher than 100,000. Then, we chose the top hundred and the bottom hundred movies.
df_movie = df_movie.sort_values("averageRating",ascending=False)
df_t = df_movie.loc[df_movie.numVotes >= 100000]
df_top = df_t.head(100)
df_bottom = df_t.tail(100)
print("Top Ten")
display(df_top.head(10))
print("Bottom Ten")
display(df_bottom.head(10))
We repeated the above processes using the subsets of our data. Here we used 5 bins of run times as a category. Again, there is a positive relationship between average ratings and the number of votes. Additionally, we can see there is a relationship between run times and average ratings, especially among the bottom movies.
nbins = 5
df_top["RuntimeCat"] = pd.qcut(df_top["runtimeMinutes"], q=nbins, labels=False)
# plot the pairwise scatterplot
print("Pairwise Scatterplots of Top Hundred Movies")
sns.set(style="ticks")
sns.pairplot(df_top, hue="RuntimeCat", palette="RdBu_r",vars=["startYear", "runtimeMinutes", "averageRating","numVotes"])
plt.show()
df_bottom["RuntimeCat"] = pd.qcut(df_bottom["runtimeMinutes"], q=nbins, labels=False)
print("Pairwise Scatterplots of Bottom Hundred Movies")
sns.set(style="ticks")
sns.pairplot(df_bottom, hue="RuntimeCat", palette="RdBu_r",vars=["startYear", "runtimeMinutes", "averageRating","numVotes"])
plt.show()
As you can see in the average ratings graph, light orange shows bottom movies, while light blue shows top movies. we can see that newly-released movies generally bottom of the list and run times of the bottom list also lower than others. And while the number of votes of top movies has a nearly normal distribution, the bottom movies have right-skewed distribution. There is no quite clear pattern for start year of movies for top and bottom list. And, while drama is the first place in the top list, interestingly action is the first for the bottom list.
fig = plt.figure(constrained_layout=True, figsize=(20,10))
gs = matplotlib.gridspec.GridSpec(ncols=2, nrows=2, figure=fig)
df_sns = df_top[['startYear', 'runtimeMinutes', 'averageRating', 'numVotes']]
print("Histograms for top & bottom hundred movies")
for i, column in enumerate(df_sns.columns):
if df_sns[column].dtype.kind not in 'bifc': continue
ax = fig.add_subplot( gs[i//2,i%2])
sns.distplot(df_sns[column],ax=ax).set_title(column)
df_sns = df_bottom[['startYear', 'runtimeMinutes', 'averageRating', 'numVotes']]
for i, column in enumerate(df_sns.columns):
if df_sns[column].dtype.kind not in 'bifc': continue
ax = fig.add_subplot( gs[i//2,i%2])
sns.distplot(df_sns[column],ax=ax).set_title(column)
fig = plt.figure(constrained_layout=True, figsize=(20,20))
gs = matplotlib.gridspec.GridSpec(ncols=2, nrows=2, figure=fig)
df_sns = df_top[['startYear', 'new_genres','topactors','topdirector']]
print("Frequencies")
for i, column in enumerate(df_sns.columns):
ax = fig.add_subplot( gs[i//2,i%2])
sns.countplot(df_sns[column],ax=ax).set_title(column)
ax.tick_params(labelrotation=90)
fig = plt.figure(constrained_layout=True, figsize=(20,20))
gs = matplotlib.gridspec.GridSpec(ncols=2, nrows=2, figure=fig)
df_sns = df_bottom[['startYear', 'new_genres','topactors','topdirector']]
print("Frequencies")
for i, column in enumerate(df_sns.columns):
ax = fig.add_subplot( gs[i//2,i%2])
sns.countplot(df_sns[column],ax=ax).set_title(column)
ax.tick_params(labelrotation=90)
After examining the subsets of our data, we also wanted to see the location distributions of average ratings for our whole data. From the choropleth map, we can see that the average rating of movies in Kazakhstan was lowest, while the average rating of movies in Azerbaijan was highest. But these extreme ratings may be due to the limited number of movies in these two countries.
df_map = df_movie.groupby(['countries']).agg({'tconst':"count", 'averageRating':"mean"})
df_map.reset_index(inplace = True)
#df_map = df_map.sort_values(by=['startYear'])
#df_map.startYear = df_map.startYear.astype(int)
fig = px.choropleth(df_map, locations="countries", # used plotly express choropleth for animation plot
color="averageRating",
locationmode='country names',
hover_name="countries",
hover_data=['tconst'],
#animation_frame =df_map.startYear,
title = 'Location Distributions of Average Ratings 1951 - 2021')
# adjusting size of map, legend place, and background colour
fig.update_layout(
autosize=False,
width=1000,
height=500,
margin=dict(
l=50,
r=50,
b=100,
t=100,
pad=4
),
template='seaborn',
#paper_bgcolor="rgb(234, 234, 242)",
legend=dict(
orientation="v",
yanchor="auto",
y=1.02,
xanchor="right",
x=1
))
fig.show()
# reference: https://plotly.github.io/plotly.py-docs/generated/plotly.express.choropleth.html
It is clearly seen that there is no high correlation between target value and the other features. Additionally, the highest correlation is between top directors and top actors.
plt.figure(figsize=(10,10))
corrMatrix = df_movie.corr()
sns.heatmap(corrMatrix, annot=True,vmin=-1, vmax=1, center=0,
cmap=sns.diverging_palette(20, 220, n=200),
square=True
)
plt.show()
#to work on Talc
df_movie.to_parquet('df_movie')
In this section, we worked on Talc. We tried a bunch of Regression Models to acquire the best prediction. We listed each model below, and we also tried different methodologies on each model to increase accuracy. Because we have lots of movies that have nearly zero votes, we realized that these values cause to decrease in our accuracy. After we limited the number of votes to a reasonable number, our predictions became better, so we worked on this data set for the other models that we run. Additionally, to compare each models, we used 0.8 and 0.2 train test split each model and also random seed as 42.
The models:
Sparkcluser on Talc:
# kill previous sparkcluster jobs just in case
try: sc.stop()
except: pass
try: sj.stop()
except: pass
! scancel -u `whoami` -n sparkcluster
import os
import atexit
import sys
import time
import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SQLContext
import findspark
from sparkhpc import sparkjob
#Exit handler to clean up the Spark cluster if the script exits or crashes
def exitHandler(sj,sc):
try:
print('Trapped Exit cleaning up Spark Context')
sc.stop()
except:
pass
try:
print('Trapped Exit cleaning up Spark Job')
sj.stop()
except:
pass
findspark.init()
# Parameters for the Spark cluster
# At present, we have to reserve an entire node at a time (this is due to an update on the system, and this will be
# addressed in the future)
nodes=1
tasks_per_node=24
memory_per_task=10000
# Please estimate walltime carefully to keep unused Spark clusters from sitting
# idle so that others may use the shared resources.
walltime="3:00" # hh:mm, half an hour
os.environ['SBATCH_PARTITION']='cpu24' #Set the appropriate TALC partition
sj = sparkjob.sparkjob(
ncores=nodes*tasks_per_node,
cores_per_executor=tasks_per_node,
memory_per_core=memory_per_task,
memory_per_executor=memory_per_task-500,
walltime=walltime
)
sj.wait_to_start()
time.sleep(60)
sc = sj.start_spark()
#Register the exit handler
atexit.register(exitHandler,sj,sc)
#You need this line if you want to use SparkSQL
scq=SQLContext(sc)
display(sc)
#loading parquet file to work on spark
df_movie = scq.read.parquet('df_movie')
df_movie = df_movie.cache()
#getting dummies for genres
from pyspark.sql.functions import when
df_with_extra_columns = df_movie.withColumn("Drama", when(df_movie.new_genres == "Drama", 1)).withColumn("Comedy", when(df_movie.new_genres == "Comedy", 1)).withColumn("Documentary", when(df_movie.new_genres == "Documentary", 1)).withColumn("Action", when(df_movie.new_genres == "Action", 1)).withColumn("Other", when(df_movie.new_genres == "Other", 1)).withColumn("Thriller", when(df_movie.new_genres == "Thriller", 1)).withColumn("Horror", when(df_movie.new_genres == "Horror", 1)).withColumn("Western", when(df_movie.new_genres == "Western", 1))
df_with_extra_columns = df_with_extra_columns.na.fill(value=0)
df_with_extra_columns.toPandas().info()
df = df_with_extra_columns[[ "startYear", "runtimeMinutes", "numVotes", "Latitude","Longitude","topactors","topdirector","Drama","Comedy","Documentary","Action","Other","Thriller","Horror","Western","averageRating"]]
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
features = ["startYear", "runtimeMinutes", "numVotes", "Latitude","Longitude","Drama","Comedy","Documentary","Action","Other","Thriller","Horror","Western"]
v_asm = VectorAssembler(inputCols=features, outputCol="features")
ml_df1 = v_asm.transform(df_with_extra_columns.select([ "startYear", "runtimeMinutes", "numVotes", "Latitude","Longitude","Drama","Comedy","Documentary","Action","Other","Thriller","Horror","Western","averageRating"])).cache()
ml_df1.show(3, truncate=False)
%%time
# code below was modified from https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier
# Load and parse the data file, converting it to a DataFrame.
data = ml_df1
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
# Split the data into training and test sets (20% held out for testing)
(trainingData, testData) = data.randomSplit([0.8, 0.2],seed=42)
# Train a RandomForest model.
rf = RandomForestRegressor(featuresCol="indexedFeatures", labelCol='averageRating')
# Chain indexer and forest in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, rf])
# Train model. This also runs the indexer.
model = pipeline.fit(trainingData)
predictions0 = model.transform(trainingData)
# Make predictions.
predictions = model.transform(testData)
# Select example rows to display.
print("Predictions on train data:")
predictions0.select("prediction", "averageRating", "features").show(5)
# display(predictions0.select("prediction", "median_house_value", "features").limit(5).toPandas())
print("Predictions on test data:")
predictions.select("prediction", "averageRating", "features").show(5)
# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(labelCol="averageRating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions, {evaluator.metricName: "rmse"})
r2 = evaluator.evaluate(predictions, {evaluator.metricName: "r2"})
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)
print("R^2 on test data = %g" % r2)
print( model.stages[1] ) # summary only
features_cols = df.columns
features_cols.remove('averageRating')
vec= VectorAssembler(inputCols=features_cols, outputCol='features')
df = vec.transform(df)
ml_ready_df = df.select(['averageRating','features'])
ml_ready_df.show(5)
# Train a RandomForest model.
data = ml_ready_df
rf = RandomForestRegressor(featuresCol="features", labelCol='averageRating', numTrees=25, maxDepth=20)
(trainingData, testData) = data.randomSplit([0.8, 0.2],seed=42)
model = rf.fit(trainingData)
predictions = model.transform(testData)
predictions.select("prediction", "averageRating").show(5)
evaluator = RegressionEvaluator(labelCol="averageRating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions, {evaluator.metricName: "rmse"})
r2 = evaluator.evaluate(predictions, {evaluator.metricName: "r2"})
print("RMSE: " + str(rmse))
print("R^2: " + str(r2))
import pandas as pd
predictions_df = predictions.toPandas()
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.plot(predictions_df.averageRating, predictions_df.prediction, 'bo',alpha=.5)
plt.xlabel('averageRating')
plt.ylabel('Prediction')
plt.suptitle("Model Performance RMSE: %f" % rmse)
plt.show()
import pandas as pd
# Convert feature importances to a pandas column
fi_df = pd.DataFrame(model.featureImportances.toArray(),
columns=['importance'])
fi_df['feature'] =pd.Series(features_cols)
fi_df.sort_values(by=['importance'],ascending=False,inplace=True)
fi_df
plt.style.use('ggplot')
plt.bar(fi_df.feature, fi_df.importance, orientation = 'vertical', alpha=.5)
plt.xticks(rotation=90)
plt.ylabel('Importance')
plt.xlabel('Feature')
plt.title('Feature Importances')
plt.show()
df9 = df_with_extra_columns.filter(df_with_extra_columns.numVotes >= 100000)
df = df9[[ "startYear", "runtimeMinutes", "numVotes", "Latitude","Longitude","topactors","topdirector","Drama","Comedy","Documentary","Action","Other","Thriller","Horror","Western","averageRating"]]
features_cols = df.columns
features_cols.remove('averageRating')
vec= VectorAssembler(inputCols=features_cols, outputCol='features')
df = vec.transform(df)
ml_ready_df = df.select(['averageRating','features'])
ml_ready_df.show(5)
# Train a RandomForest model.
data = ml_ready_df
rf = RandomForestRegressor(featuresCol="features", labelCol='averageRating', numTrees=65, maxDepth=30)
(trainingData, testData) = data.randomSplit([0.8, 0.2],seed=42)
model = rf.fit(trainingData)
predictions = model.transform(testData)
predictions.select("prediction", "averageRating").show(5)
evaluator = RegressionEvaluator(labelCol="averageRating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions, {evaluator.metricName: "rmse"})
r2 = evaluator.evaluate(predictions, {evaluator.metricName: "r2"})
print("RMSE: " + str(rmse))
print("R^2: " + str(r2))
import pandas as pd
predictions_df = predictions.toPandas()
plt.style.use('ggplot')
plt.plot(predictions_df.averageRating, predictions_df.prediction, 'bo', alpha=.5)
plt.xlabel('averageRating')
plt.ylabel('Prediction')
plt.suptitle("Model Performance RMSE: %f" % rmse)
plt.show()
model = rf.fit(trainingData)
predictions = model.transform(ml_ready_df)
predictions.select("prediction", "averageRating").show(5)
predictions_df = predictions.toPandas()
plt.style.use('ggplot')
plt.plot(predictions_df.averageRating, predictions_df.prediction, 'bo', alpha=.5)
plt.xlabel('averageRating')
plt.ylabel('Prediction')
plt.suptitle("Model Performance on the whole data set")
plt.show()
import pandas as pd
# Convert feature importances to a pandas column
fi_df = pd.DataFrame(model.featureImportances.toArray(),
columns=['importance'])
fi_df['feature'] =pd.Series(features_cols)
fi_df.sort_values(by=['importance'],ascending=False,inplace=True)
fi_df
plt.style.use('ggplot')
plt.bar(fi_df.feature, fi_df.importance, orientation = 'vertical',alpha=.5)
plt.xticks(rotation=90)
plt.ylabel('Importance')
plt.xlabel('Feature')
plt.title('Feature Importances')
plt.show()
df9 = df_with_extra_columns.filter(df_with_extra_columns.numVotes >= 100000)
df = df9[[ "startYear", "runtimeMinutes", "numVotes", "Latitude","Longitude","topactors","topdirector","Drama","Comedy","Documentary","Action","Other","Thriller","Horror","Western","averageRating"]]
# reference: https://runawayhorse001.github.io/LearningApacheSpark/regression.html
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
# I provide two ways to build the features and labels
# Method 2 (good for large features):
def transData(data):
return data.rdd.map(lambda r: [Vectors.dense(r[:-1]),r[-1]]).toDF(['features','label'])
transformed= transData(df)
transformed.show(5)
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer = VectorIndexer(inputCol="features", \
outputCol="indexedFeatures",\
maxCategories=4).fit(transformed)
data = featureIndexer.transform(transformed)
data.show(5,True)
#reference: https://spark.apache.org/docs/latest/ml-tuning.html
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
# Prepare training and test data.
data = data
train, test = data.randomSplit([0.8, 0.2], seed=42)
lr = LinearRegression(maxIter=10)
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# TrainValidationSplit will try all combinations of values and determine best model using
# the evaluator.
paramGrid = ParamGridBuilder()\
.addGrid(lr.regParam, [0.1, 0.005]) \
.addGrid(lr.fitIntercept, [False, True])\
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
.build()
# In this case the estimator is simply the linear regression.
# A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
tvs = TrainValidationSplit(estimator=lr,
estimatorParamMaps=paramGrid,
evaluator=RegressionEvaluator(),
# 80% of the data will be used for training, 20% for validation.
trainRatio=0.5)
# Run TrainValidationSplit, and choose the best set of parameters.
model = tvs.fit(train)
# Make predictions on test data. model is the model with combination of parameters
# that performed best.
model.transform(test)\
.select("features", "label", "prediction")\
.show(5)
# Make predictions.
predictions = model.transform(test)
# Select example rows to display.
predictions.select("features","label","prediction").show(5)
from pyspark.ml.evaluation import RegressionEvaluator
# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(labelCol="label",
predictionCol="prediction",
metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)
y_true = predictions.select("label").toPandas()
y_pred = predictions.select("prediction").toPandas()
import sklearn.metrics
r2_score = sklearn.metrics.r2_score(y_true, y_pred)
print('r2_score: {0}'.format(r2_score))
# Import LinearRegression class
from pyspark.ml.regression import GeneralizedLinearRegression
# Define LinearRegression algorithm
glr = GeneralizedLinearRegression(family="gaussian", link="identity",\
maxIter=10, regParam=0.3)
# Chain indexer and tree in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, glr])
model = pipeline.fit(trainingData)
# Make predictions.
predictions = model.transform(testData)
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.evaluation import RegressionEvaluator
# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(labelCol="label",
predictionCol="prediction",
metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)
y_true = predictions.select("label").toPandas()
y_pred = predictions.select("prediction").toPandas()
import sklearn.metrics
r2_score = sklearn.metrics.r2_score(y_true, y_pred)
print('r2_score: {0}'.format(r2_score))
# Import LinearRegression class
from pyspark.ml.regression import GBTRegressor
# Define LinearRegression algorithm
rf = GBTRegressor() #numTrees=2, maxDepth=2, seed=42
# Chain indexer and tree in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, rf])
model = pipeline.fit(trainingData)
predictions = model.transform(testData)
# Select example rows to display.
predictions.select("features","label", "prediction").show(5)
# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(
labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)
import sklearn.metrics
r2_score = sklearn.metrics.r2_score(y_true, y_pred)
print('r2_score: {:4.3f}'.format(r2_score))
When considering that our guiding questions respectively,
In the first step, we tried to figure out which features can help to predict ratings. For this aim, we have some features of movies. These are title, region, language, types, primary title, original title, start year, run time, genres, average ratings, number of votes, countries information directly from IMDb. Firstly, we grouped some of them for elimination. We used type as a limiting factor of data set, region, language, and countries are location information so we decided to add latitude, longitude information of countries to use in the analysis. we did not prefer to use title information in our analysis. Additionally, we have actors and directors information, we used this information converting them as several top actors and the number of top directors. So, before beginning to run machine learning models, we have these features: start year, run time, genres, average ratings, number of votes, latitude, longitude, top actors and top directors.
After choosing features, to examine futures we made exploratory data analysis, and we looked at each feature of behaviour. When we draw scatter plots of them, we realized that it looks like the number of votes and ratings has positive relation. Additionally, we saw the same situation in the correlation matrix, and we guessed that number of votes can be the most important feature for our analysis. Then, We also looked at each of the distributions. Drama has the highest movie genre and the movie numbers roughly increase year by year.
We thought that the top-rated and bottom rated movies can give clue for our analysis, so we decided to look at closely. And we realized that they have quite different behaviours. While the run time of bottom movies shorter than the top, the top movies have a far higher number of votes. Also, generally, top movies are older than the others. Additionally, interestingly while the majority of top movies are drama, the majority of bottom movies are action. And, we also expected that top actors and directors should have mostly in top movies, however, the distribution of top actors and directors are quite similar for top and bottom movies.
After investigation of our guiding question, we made models based on these answers, and we found the below results. As you can see, the random forest gives us the best model. And, when we did exploratory data analysis, we realized that the number of votes has a very right-skewed distribution, so lots of movies have very few votes. So, we also tried our model on limiting data set with the higher number of votes, it gave us far more good results in random forest and the other models. Even if when we trained our best model on the whole data set, we saw that the accuracy is better than the other models.
| Model Name | RMSE | $R^2$ |
|---|---|---|
| Random Forest Regression (with the whole data) | 1.05 | 0.31 |
| Random Forest Regression (with best hyperparameters) | 0.89 | 0.48 |
| Random Forest Regression (with higher number of Votes) | 0.37 | 0.82 |
| Linear Regression (with higher number of Votes) | 0.58 | 0.51 |
| Generalized Linear Regression (with higher number of Votes) | 0.61 | 0.46 |
| Gradient-boosted tree Regression (with higher number of Votes) | 0.49 | 0.46 |
We learned lots of things during doing this project. Deal with such a big data set was the biggest challenge for us. Then, even if our data set source quite popular, the data was very dirty and most of part of our analysis passed with engaging with cleaning. Lastly, even if we tried lots of features and models for our project, still it can be tried more things. For improving our analysis, it can be tried to add new features like text analysis of titles, different genre lists, and also different movie features. And also it can be tried to run other more advanced machine learning models like neural networks.
[1] Contents - Learning Apache Spark with Python documentation. (n.d.).
Https://Runawayhorse001.Github.Io/LearningApacheSpark/Index.Html. Retrieved March 27, 2021, from
https://runawayhorse001.github.io/LearningApacheSpark/index.html
[2] MLlib: Main Guide - Spark 3.1.1 Documentation. (n.d.).
Https://Spark.Apache.Org/Docs/Latest/Ml-Guide.Html. Retrieved March 27, 2021, from
https://spark.apache.org/docs/latest/ml-guide.html
[3] Wenig, B., Damji, J. S., Das, T., & Lee, D. (2020). Learning Spark (1st ed., Vol. 1). Van Duuren Media.
[4] M. Marović, M. Mihoković, M. Mikša, S. Pribil and A. Tus, "Automatic movie ratings prediction using machine learning," 2011 Proceedings of the 34th International Convention MIPRO, Opatija, Croatia, 2011, pp. 1640-1645.